Open In Colab

The Play Store apps data has enormous potential to drive app-making businesses to success. Actionable insights can be drawn for developers to work on and capture the Android market. ¶

Each app (row) has values for catergory, rating, size, and more. Another dataset contains customer reviews of the android apps.¶

Explore and analyze the data to discover key factors responsible for app engagement and success. ¶

📱 Hi everybody !¶

In this notebook, I'm gonna analyze Google Play Store datas. While I was analyzing the data, I used Python. This study is my first data analyzing study.

Google Play Store apps and reviews Mobile apps are everywhere. They are easy to create and can be lucrative. Because of these two factors, more and more apps are being developed. In this notebook, we will do a comprehensive analysis of the Android app market by comparing over ten thousand apps in Google Play across different categories. We'll look for insights in the data to devise strategies to drive growth and retention.

Let's take a look at the data, which consists of two files:

  • playstore data.csv: contains all the details of the applications on Google Play. There are 13 features that describe a given app.
  • user_reviews.csv: contains 100 reviews for each app, most helpful first. The text in each review has been pre-processed and attributed with three new features: Sentiment (Positive, Negative or Neutral), Sentiment Polarity and Sentiment Subjectivity.

Before jumping into the data's provided, let me first explain you about the EDA analysis.

Problem Statements¶

  1. What are the top categories on Play Store?
  2. Are majority of the apps Paid or Free?
  3. How importance is the rating of the application?
  4. Which categories from the audience should the app be based on?
  5. Which category has the most no. of installations?
  6. How does the count of apps varies by Genres?
  7. How does the last update has an effect on the rating?
  8. How are ratings affected when the app is a paid one?
  9. How are reviews and ratings co-related?
  10. Lets us discuss the sentiment subjectivity.
  11. Is subjectivity and polarity proportional to each other?
  12. What is the percentage of review sentiments?
  13. How is sentiment polarity varying for paid and free apps?
  14. How Content Rating affect over the App?
  15. Does Last Update date has an effects on rating?
  16. Distribution of App update over the Year.
  17. Distribution of Paid and Free app updated over the Month.

What is Exploratory Data Analysis?¶

Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets for patterns, and anomalies (outliers), and form hypotheses based on our understanding of the dataset and summarize their main characteristics, often employing data visualization methods. It is an important step in any Data Analysis or Data Science project. It helps determine how best to manipulate data sources to get the answers you need.

EDA involves generating summary statistics for numerical data in the dataset and creating various graphical representations to understand the data better and make it more attractive and appealing.

The following are the various steps involved in the EDA process:

  1. Problem Statement - We shall brainstorm and understand the given data set. We shall study the attributes present in it and try to do a philosophical analysis about their meaning and importance for this problem.
  2. Hypothesis - Upon studying the attributes present in the data base, we shall develop some basic hypothesis on which we can work and play with the data to look for the varied results which we can get out of it.
  3. Univariate Analysis - It is the simplest form of analyzing the data. In this we would initially pick up a single attribute and study it in and out. It doesn't deal with any sort of co-relation and it's major purpose is to describe. It takes data, summarizes that data and finds patterns in the data.
  4. Bivariate Analysis - This analysis is related to cause and the relationship between the two attributes. We will try to understand the dependency of attributes on each other.
  5. Multivariate Analysis - This is done when more than two variables have to be analyzed simultaneously.
  6. Data Cleaning - We shall clean the dataset and handle the missing data, outliers and categorical variables.
  7. Testing Hypothesis - We shall check if our data meets the assumptions required by most of the multivariate techniques.

▶Exploring Play store data:¶

In [342]:
#import library
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import numpy as np # linear algebra
import matplotlib.pyplot as plt
import seaborn as sns  # visualization tool
from datetime import datetime
# plotly
import plotly 
plotly.offline.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import warnings
#sns.set(font_scale=1.5)
warnings.filterwarnings("ignore")

Lets Explore play store DataFrame¶

In [343]:
# loading csv File
ps_df=pd.read_csv(r"E:\0001Almabetter\2.numerical python programming\project-eda-numeric-python\Play Store Data.csv")
In [344]:
# Display the Play Store App data head

ps_df.head()
Out[344]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19M 10,000+ Free 0 Everyone Art & Design January 7, 2018 1.0.0 4.0.3 and up
1 Coloring book moana ART_AND_DESIGN 3.9 967 14M 500,000+ Free 0 Everyone Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 8.7M 5,000,000+ Free 0 Everyone Art & Design August 1, 2018 1.2.4 4.0.3 and up
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 25M 50,000,000+ Free 0 Teen Art & Design June 8, 2018 Varies with device 4.2 and up
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 967 2.8M 100,000+ Free 0 Everyone Art & Design;Creativity June 20, 2018 1.1 4.4 and up
In [345]:
ps_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  object 
 4   Size            10841 non-null  object 
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content Rating  10840 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last Updated    10841 non-null  object 
 11  Current Ver     10833 non-null  object 
 12  Android Ver     10838 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.1+ MB
In [346]:
# Finding the number of rows and columns in the given dataset
print(ps_df.columns)
Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',
       'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver',
       'Android Ver'],
      dtype='object')
In [347]:
ps_df.shape
Out[347]:
(10841, 13)
In [348]:
ps_df.dtypes
Out[348]:
App                object
Category           object
Rating            float64
Reviews            object
Size               object
Installs           object
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object
In [349]:
ps_df.describe()
Out[349]:
Rating
count 9367.000000
mean 4.193338
std 0.537431
min 1.000000
25% 4.000000
50% 4.300000
75% 4.500000
max 19.000000

Let us first define what information the columns contain based on our inspection.

play_store dataframe has 10841 rows and 13 columns. The 13 columns are identified as below:

  1. App - It tells us about the name of the application with a short description (optional).
  2. Category - It gives the category to the app.
  3. Rating - It contains the average rating the respective app received from its users.
  4. Reviews - It tells us about the total number of users who have given a review for the application.
  5. Size - It tells us about the size being occupied the application on the mobile phone.
  6. Installs - It tells us about the total number of installs/downloads for an application.
  7. Type - IIt states whether an app is free to use or paid.
  8. Price - It gives the price payable to install the app. For free type apps, the price is zero.
  9. Content Rating - It states whether or not an app is suitable for all age groups or not.
  10. Genres - It tells us about the various other categories to which an application can belong.
  11. Last Updated - It tells us about the when the application was updated.
  12. Current Ver - It tells us about the current version of the application. 13.Android Ver - It tells us about the android version which can support the application on its platform.

Cleaning of the data¶

The three features that we will be working with most frequently henceforth are Installs, Size, and Price. A careful glance of the dataset reveals that some of these columns mandate data cleaning in order to be consumed by code we'll write later. Specifically, the presence of special characters (, $ +) and letters (M k) in the Installs, Size, and Price columns make their conversion to a numerical data type difficult. Let's clean by removing these and converting each column to a numeric type.

Removing the Nan value and Duplicate present in the data set

Handling the NaN values in the Play store data¶

In [350]:
# This user define function will give the type,count of null and non null values as well as null ratio
def playstoreinfo():
    temp=pd.DataFrame(index=ps_df.columns)
    temp["datatype"]=ps_df.dtypes
    temp["not null values"]=ps_df.count()
    temp["null value"]=ps_df.isnull().sum()
    temp["% of the null value"]=ps_df.isnull().mean()
    temp["unique count"]=ps_df.nunique()
    return temp
playstoreinfo()
Out[350]:
datatype not null values null value % of the null value unique count
App object 10841 0 0.000000 9660
Category object 10841 0 0.000000 34
Rating float64 9367 1474 0.135965 40
Reviews object 10841 0 0.000000 6002
Size object 10841 0 0.000000 462
Installs object 10841 0 0.000000 22
Type object 10840 1 0.000092 3
Price object 10841 0 0.000000 93
Content Rating object 10840 1 0.000092 6
Genres object 10841 0 0.000000 120
Last Updated object 10841 0 0.000000 1378
Current Ver object 10833 8 0.000738 2832
Android Ver object 10838 3 0.000277 33

Findings

The number of null values are:

Rating has 1474 null values which contributes 13.60% of the data. Type has 1 null value which contributes 0.01% of the data. Content_Rating has 1 null value which contributes 0.01% of the data. Current_Ver has 8 null values which contributes 0.07% of the data. Android_Ver has 3 null values which contributes 0.03% of the data.

Lets first deal with the columns which contains lesser number of NaN values. By going through the NaN values, we must come up with a way to replace them with non NaN values or we need to come up with a reason for having NaN.

1). Android Ver: There are a total of 3 NaN values in this column.¶


In [351]:
# The rows containing NaN values in the Android Ver column
ps_df[ps_df["Android Ver"].isnull()]
Out[351]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
4453 [substratum] Vacuum: P PERSONALIZATION 4.4 230 11M 1,000+ Paid $1.49 Everyone Personalization July 20, 2018 4.4 NaN
4490 Pi Dark [substratum] PERSONALIZATION 4.5 189 2.1M 10,000+ Free 0 Everyone Personalization March 27, 2018 1.1 NaN
10472 Life Made WI-Fi Touchscreen Photo Frame 1.9 19.0 3.0M 1,000+ Free 0 Everyone NaN February 11, 2018 1.0.19 4.0 and up NaN
In [352]:
# Finding the different values the 'Android Ver' column takes
ps_df["Android Ver"].value_counts()
Out[352]:
4.1 and up            2451
4.0.3 and up          1501
4.0 and up            1375
Varies with device    1362
4.4 and up             980
2.3 and up             652
5.0 and up             601
4.2 and up             394
2.3.3 and up           281
2.2 and up             244
4.3 and up             243
3.0 and up             241
2.1 and up             134
1.6 and up             116
6.0 and up              60
7.0 and up              42
3.2 and up              36
2.0 and up              32
5.1 and up              24
1.5 and up              20
4.4W and up             12
3.1 and up              10
2.0.1 and up             7
8.0 and up               6
7.1 and up               3
4.0.3 - 7.1.1            2
5.0 - 8.0                2
1.0 and up               2
7.0 - 7.1.1              1
4.1 - 7.1.1              1
5.0 - 6.0                1
2.2 - 7.1.1              1
5.0 - 7.1.1              1
Name: Android Ver, dtype: int64

Since the NaN values in the Android Ver column cannot be replaced by any particular value, and, since there are only 3 rows which contain NaN values in this column, which accounts to less than 0.03% of the total rows in the given dataset, it can be be dropped.

In [353]:
ps_df.shape
Out[353]:
(10841, 13)
In [354]:
# dropping rows corresponding to the to the NaN values in the 'Android Ver' column.
ps_df =ps_df[ps_df['Android Ver'].notna()]
# Shape of the updated dataframe
ps_df.shape
Out[354]:
(10838, 13)

We were successfully able to handle the NaN values in theAndroid Vercolumn.

2). Current Ver: There are a total of 8 NaN values in this column.¶

In [355]:
# The rows containing NaN values in the Current Ver column
ps_df[ps_df["Current Ver"].isnull()]
Out[355]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
15 Learn To Draw Kawaii Characters ART_AND_DESIGN 3.2 55 2.7M 5,000+ Free 0 Everyone Art & Design June 6, 2018 NaN 4.2 and up
1553 Market Update Helper LIBRARIES_AND_DEMO 4.1 20145 11k 1,000,000+ Free 0 Everyone Libraries & Demo February 12, 2013 NaN 1.5 and up
6322 Virtual DJ Sound Mixer TOOLS 4.2 4010 8.7M 500,000+ Free 0 Everyone Tools May 10, 2017 NaN 4.0 and up
6803 BT Master FAMILY NaN 0 222k 100+ Free 0 Everyone Education November 6, 2016 NaN 1.6 and up
7333 Dots puzzle FAMILY 4.0 179 14M 50,000+ Paid $0.99 Everyone Puzzle April 18, 2018 NaN 4.0 and up
7407 Calculate My IQ FAMILY NaN 44 7.2M 10,000+ Free 0 Everyone Entertainment April 3, 2017 NaN 2.3 and up
7730 UFO-CQ TOOLS NaN 1 237k 10+ Paid $0.99 Everyone Tools July 4, 2016 NaN 2.0 and up
10342 La Fe de Jesus BOOKS_AND_REFERENCE NaN 8 658k 1,000+ Free 0 Everyone Books & Reference January 31, 2017 NaN 3.0 and up
In [356]:
# Finding the different values the 'Current Ver' column takes
ps_df['Current Ver'].value_counts()
Out[356]:
Varies with device    1459
1.0                    809
1.1                    263
1.2                    178
2.0                    151
                      ... 
5.44.1                   1
7.16.8                   1
04.08.00                 1
2.10.06                  1
2.0.148.0                1
Name: Current Ver, Length: 2831, dtype: int64

Since there are only 8 rows which contain NaN values in the Current Ver column, and it accounts to just around 0.07% of the total rows in the given dataset, and there is no particular value with which we can replace it, these rows can be dropped.

In [357]:
# dropping rows corresponding to the values which contain NaN in the column 'Current Ver'.
ps_df=ps_df[ps_df["Current Ver"].notna()]
# Shape of the updated dataframe
ps_df.shape
Out[357]:
(10830, 13)

3). Type: There is only one NaN value in this column.¶

In [358]:
# The row containing NaN values in the Type column
ps_df[ps_df["Type"].isnull()]
Out[358]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
9148 Command & Conquer: Rivals FAMILY NaN 0 Varies with device 0 NaN 0 Everyone 10+ Strategy June 28, 2018 Varies with device Varies with device
In [359]:
# Finding the different values the 'Type' column takes
ps_df["Type"].value_counts()
Out[359]:
Free    10032
Paid      797
Name: Type, dtype: int64

The Typecolumn contains only two entries, namely, Free and Paid. Also, if the app is of type-paid, the price of that app will be printed in the corresponding Price column, else, it will show as '0'. In this case, the price for the respective app is printed as '0', which means the app is of type-free. Hence we can replace this NaN value with Free.

In [360]:
# Replacing the NaN value in 'Type' column corresponding to row index 9148 with 'Free'
ps_df.loc[9148,'Type']='Free'
In [361]:
ps_df[ps_df['Type'].isnull()]
Out[361]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver

4). Rating: This column contains 1470 NaN values.¶

In [362]:
# The rows containing NaN values in the Rating column
ps_df[ps_df['Rating'].isnull()]
Out[362]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
23 Mcqueen Coloring pages ART_AND_DESIGN NaN 61 7.0M 100,000+ Free 0 Everyone Art & Design;Action & Adventure March 7, 2018 1.0.0 4.1 and up
113 Wrinkles and rejuvenation BEAUTY NaN 182 5.7M 100,000+ Free 0 Everyone 10+ Beauty September 20, 2017 8.0 3.0 and up
123 Manicure - nail design BEAUTY NaN 119 3.7M 50,000+ Free 0 Everyone Beauty July 23, 2018 1.3 4.1 and up
126 Skin Care and Natural Beauty BEAUTY NaN 654 7.4M 100,000+ Free 0 Teen Beauty July 17, 2018 1.15 4.1 and up
129 Secrets of beauty, youth and health BEAUTY NaN 77 2.9M 10,000+ Free 0 Mature 17+ Beauty August 8, 2017 2.0 2.3 and up
... ... ... ... ... ... ... ... ... ... ... ... ... ...
10824 Cardio-FR MEDICAL NaN 67 82M 10,000+ Free 0 Everyone Medical July 31, 2018 2.2.2 4.4 and up
10825 Naruto & Boruto FR SOCIAL NaN 7 7.7M 100+ Free 0 Teen Social February 2, 2018 1.0 4.0 and up
10831 payermonstationnement.fr MAPS_AND_NAVIGATION NaN 38 9.8M 5,000+ Free 0 Everyone Maps & Navigation June 13, 2018 2.0.148.0 4.0 and up
10835 FR Forms BUSINESS NaN 0 9.6M 10+ Free 0 Everyone Business September 29, 2016 1.1.5 4.0 and up
10838 Parkinson Exercices FR MEDICAL NaN 3 9.5M 1,000+ Free 0 Everyone Medical January 20, 2017 1.0 2.2 and up

1470 rows × 13 columns

In [363]:
ps_df['Rating'].max()
Out[363]:
5.0

Also, we know that the rating of any app in the play store will be in between 1 and 5. Lets check whether there are any ratings out of this range.

In [364]:
ps_df[(ps_df['Rating'] <1) | (ps_df['Rating']>5)]
Out[364]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
  • The Rating column contains 1470 NaN values which accounts to apprximately 13.5% of the rows in the entire dataset. It is not practical to drop these rows because by doing so, we will loose a large amount of data, which may impact the final quality of the analysis.
  • The NaN values in this case can be imputed by the aggregate (mean or median) of the remaining values in the Rating column.
In [365]:
ps_df['Rating'].mean()
Out[365]:
4.191837606837612
In [366]:
ps_df['Rating'].median()
Out[366]:
4.3
In [367]:
ps_df
Out[367]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19M 10,000+ Free 0 Everyone Art & Design January 7, 2018 1.0.0 4.0.3 and up
1 Coloring book moana ART_AND_DESIGN 3.9 967 14M 500,000+ Free 0 Everyone Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 8.7M 5,000,000+ Free 0 Everyone Art & Design August 1, 2018 1.2.4 4.0.3 and up
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 25M 50,000,000+ Free 0 Teen Art & Design June 8, 2018 Varies with device 4.2 and up
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 967 2.8M 100,000+ Free 0 Everyone Art & Design;Creativity June 20, 2018 1.1 4.4 and up
... ... ... ... ... ... ... ... ... ... ... ... ... ...
10836 Sya9a Maroc - FR FAMILY 4.5 38 53M 5,000+ Free 0 Everyone Education July 25, 2017 1.48 4.1 and up
10837 Fr. Mike Schmitz Audio Teachings FAMILY 5.0 4 3.6M 100+ Free 0 Everyone Education July 6, 2018 1.0 4.1 and up
10838 Parkinson Exercices FR MEDICAL NaN 3 9.5M 1,000+ Free 0 Everyone Medical January 20, 2017 1.0 2.2 and up
10839 The SCP Foundation DB fr nn5n BOOKS_AND_REFERENCE 4.5 114 Varies with device 1,000+ Free 0 Mature 17+ Books & Reference January 19, 2015 Varies with device Varies with device
10840 iHoroscope - 2018 Daily Horoscope & Astrology LIFESTYLE 4.5 398307 19M 10,000,000+ Free 0 Everyone Lifestyle July 25, 2018 Varies with device Varies with device

10830 rows × 13 columns

Visualization of distribution of rating using displot and detecting the outliers through boxplot.

In [368]:
fig, ax = plt.subplots(2,1, figsize=(12,7))
sns.distplot(ps_df['Rating'],color='firebrick',ax=ax[0]);
sns.boxplot(x='Rating',data=ps_df, ax=ax[1]);
  • The mean of the average ratings (excluding the NaN values) comes to be 4.2.

  • The median of the entries (excluding the NaN values) in the 'Rating' column comes to be 4.3. From this we can say that 50% of the apps have an average rating of above 4.3, and the rest below 4.3.

  • From the distplot visualizations, it is clear that the ratings are left skewed.
  • We know that if the variable is skewed, the mean is biased by the values at the far end of the distribution. Therefore, the median is a better representation of the majority of the values in the variable.
  • Hence we will impute the NaN values in the Rating column with its median.
In [369]:
ps_df['Rating'].median()
Out[369]:
4.3
In [370]:
# Replacing the NaN values in the 'Rating' colunm with its median value
ps_df['Rating'].fillna(value=ps_df['Rating'].median(),inplace=True)
In [371]:
ps_df['Rating'].value_counts()
Out[371]:
4.3    2546
4.4    1108
4.5    1037
4.2     951
4.6     823
4.1     707
4.0     567
4.7     499
3.9     386
3.8     303
5.0     274
3.7     239
4.8     234
3.6     174
3.5     163
3.4     128
3.3     102
4.9      87
3.0      83
3.1      69
3.2      63
2.9      45
2.8      42
2.7      25
2.6      25
2.5      21
2.3      20
2.4      19
1.0      16
2.2      14
1.9      13
2.0      12
1.7       8
1.8       8
2.1       8
1.6       4
1.4       3
1.5       3
1.2       1
Name: Rating, dtype: int64
In [372]:
ps_df['Rating'].isna().sum()
Out[372]:
0

Handling duplicates values and Manipulating dataset:¶

1).Handling the duplicates in theApp column¶

In [373]:
# Handling the error values in the Play store data
ps_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10830 entries, 0 to 10840
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10830 non-null  object 
 1   Category        10830 non-null  object 
 2   Rating          10830 non-null  float64
 3   Reviews         10830 non-null  object 
 4   Size            10830 non-null  object 
 5   Installs        10830 non-null  object 
 6   Type            10830 non-null  object 
 7   Price           10830 non-null  object 
 8   Content Rating  10830 non-null  object 
 9   Genres          10830 non-null  object 
 10  Last Updated    10830 non-null  object 
 11  Current Ver     10830 non-null  object 
 12  Android Ver     10830 non-null  object 
dtypes: float64(1), object(12)
memory usage: 1.4+ MB
In [374]:
ps_df['App'].value_counts()
Out[374]:
ROBLOX                                                9
CBS Sports App - Scores, News, Stats & Watch Live     8
Candy Crush Saga                                      7
8 Ball Pool                                           7
ESPN                                                  7
                                                     ..
Meet U - Get Friends for Snapchat, Kik & Instagram    1
U-Report                                              1
U of I Community Credit Union                         1
Waiting For U Launcher Theme                          1
iHoroscope - 2018 Daily Horoscope & Astrology         1
Name: App, Length: 9649, dtype: int64
In [375]:
# Inspecting the duplicates values.
ps_df[ps_df['App']=='ROBLOX']
Out[375]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
1653 ROBLOX GAME 4.5 4447388 67M 100,000,000+ Free 0 Everyone 10+ Adventure;Action & Adventure July 31, 2018 2.347.225742 4.1 and up
1701 ROBLOX GAME 4.5 4447346 67M 100,000,000+ Free 0 Everyone 10+ Adventure;Action & Adventure July 31, 2018 2.347.225742 4.1 and up
1748 ROBLOX GAME 4.5 4448791 67M 100,000,000+ Free 0 Everyone 10+ Adventure;Action & Adventure July 31, 2018 2.347.225742 4.1 and up
1841 ROBLOX GAME 4.5 4449882 67M 100,000,000+ Free 0 Everyone 10+ Adventure;Action & Adventure July 31, 2018 2.347.225742 4.1 and up
1870 ROBLOX GAME 4.5 4449910 67M 100,000,000+ Free 0 Everyone 10+ Adventure;Action & Adventure July 31, 2018 2.347.225742 4.1 and up
2016 ROBLOX FAMILY 4.5 4449910 67M 100,000,000+ Free 0 Everyone 10+ Adventure;Action & Adventure July 31, 2018 2.347.225742 4.1 and up
2088 ROBLOX FAMILY 4.5 4450855 67M 100,000,000+ Free 0 Everyone 10+ Adventure;Action & Adventure July 31, 2018 2.347.225742 4.1 and up
2206 ROBLOX FAMILY 4.5 4450890 67M 100,000,000+ Free 0 Everyone 10+ Adventure;Action & Adventure July 31, 2018 2.347.225742 4.1 and up
4527 ROBLOX FAMILY 4.5 4443407 67M 100,000,000+ Free 0 Everyone 10+ Adventure;Action & Adventure July 31, 2018 2.347.225742 4.1 and up
In [376]:
ps_df[ps_df['App'].duplicated()]
Out[376]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
229 Quick PDF Scanner + OCR FREE BUSINESS 4.2 80805 Varies with device 5,000,000+ Free 0 Everyone Business February 26, 2018 Varies with device 4.0.3 and up
236 Box BUSINESS 4.2 159872 Varies with device 10,000,000+ Free 0 Everyone Business July 31, 2018 Varies with device Varies with device
239 Google My Business BUSINESS 4.4 70991 Varies with device 5,000,000+ Free 0 Everyone Business July 24, 2018 2.19.0.204537701 4.4 and up
256 ZOOM Cloud Meetings BUSINESS 4.4 31614 37M 10,000,000+ Free 0 Everyone Business July 20, 2018 4.1.28165.0716 4.0 and up
261 join.me - Simple Meetings BUSINESS 4.0 6989 Varies with device 1,000,000+ Free 0 Everyone Business July 16, 2018 4.3.0.508 4.4 and up
... ... ... ... ... ... ... ... ... ... ... ... ... ...
10715 FarmersOnly Dating DATING 3.0 1145 1.4M 100,000+ Free 0 Mature 17+ Dating February 25, 2016 2.2 4.0 and up
10720 Firefox Focus: The privacy browser COMMUNICATION 4.4 36981 4.0M 1,000,000+ Free 0 Everyone Communication July 6, 2018 5.2 5.0 and up
10730 FP Notebook MEDICAL 4.5 410 60M 50,000+ Free 0 Everyone Medical March 24, 2018 2.1.0.372 4.4 and up
10753 Slickdeals: Coupons & Shopping SHOPPING 4.5 33599 12M 1,000,000+ Free 0 Everyone Shopping July 30, 2018 3.9 4.4 and up
10768 AAFP MEDICAL 3.8 63 24M 10,000+ Free 0 Everyone Medical June 22, 2018 2.3.1 5.0 and up

1181 rows × 13 columns

In [377]:
# dropping duplicates from the 'App' column.
ps_df.drop_duplicates(subset = 'App', inplace = True)
ps_df.shape
Out[377]:
(9649, 13)
In [378]:
# Checking whether the duplicates in the 'App' column are taken care of or not
ps_df[ps_df['App']=='ROBLOX']
Out[378]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
1653 ROBLOX GAME 4.5 4447388 67M 100,000,000+ Free 0 Everyone 10+ Adventure;Action & Adventure July 31, 2018 2.347.225742 4.1 and up
In [379]:
ps_df[ps_df['App'].duplicated()]
Out[379]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver

We have successfully handled all the duplicate values in the App column. The resultant number of rows after droping the duplicate rows in the app column come out to be 9649.

2). Changing the datatype of the Last Updated column from string to datetime.¶

In [380]:
ps_df['Last Updated']
Out[380]:
0         January 7, 2018
1        January 15, 2018
2          August 1, 2018
3            June 8, 2018
4           June 20, 2018
               ...       
10836       July 25, 2017
10837        July 6, 2018
10838    January 20, 2017
10839    January 19, 2015
10840       July 25, 2018
Name: Last Updated, Length: 9649, dtype: object
In [381]:
# Pandas to_datetime() function applied to the values in the last updated column helps to convert string Date time into Python Date time object.
ps_df["Last Updated"] = pd.to_datetime(ps_df['Last Updated'])
ps_df.head()
Out[381]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19M 10,000+ Free 0 Everyone Art & Design 2018-01-07 1.0.0 4.0.3 and up
1 Coloring book moana ART_AND_DESIGN 3.9 967 14M 500,000+ Free 0 Everyone Art & Design;Pretend Play 2018-01-15 2.0.0 4.0.3 and up
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 8.7M 5,000,000+ Free 0 Everyone Art & Design 2018-08-01 1.2.4 4.0.3 and up
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 25M 50,000,000+ Free 0 Teen Art & Design 2018-06-08 Varies with device 4.2 and up
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 967 2.8M 100,000+ Free 0 Everyone Art & Design;Creativity 2018-06-20 1.1 4.4 and up
In [382]:
ps_df['Last Updated']
Out[382]:
0       2018-01-07
1       2018-01-15
2       2018-08-01
3       2018-06-08
4       2018-06-20
           ...    
10836   2017-07-25
10837   2018-07-06
10838   2017-01-20
10839   2015-01-19
10840   2018-07-25
Name: Last Updated, Length: 9649, dtype: datetime64[ns]

3). Changing the datatype of thePrice column from string to float.¶

In [383]:
ps_df['Price'].value_counts()
Out[383]:
0          8896
$0.99       143
$2.99       124
$1.99        73
$4.99        70
           ... 
$18.99        1
$389.99       1
$19.90        1
$1.75         1
$1.04         1
Name: Price, Length: 92, dtype: int64

To convert this column from string to float, we must first drop the $ symbol from the all the values. Then we can assign float datatype to those values.

Applying the drop_dollar function to convert the values in thePrice column from string datatype to float datatype.

In [384]:
s = '$1.23'
In [385]:
if "$" in s:
    print(s[1:])
1.23
In [386]:
# Creating a function drop-dollar which dropps the $ symbol if it is present and returns the output which is of float datatype.
def convert_dollar(val):
    if '$' in val:
        return float(val[1:])
    else:
        return float(val)
In [387]:
# The drop_dollar funtion applied to the price column
ps_df['Price'] = ps_df['Price'].apply(lambda x: convert_dollar(x))
In [388]:
ps_df['Price'].max()
Out[388]:
400.0
In [389]:
ps_df['Price'].dtype
Out[389]:
dtype('float64')
In [390]:
ps_df[ps_df['Price']==0]
Out[390]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19M 10,000+ Free 0.0 Everyone Art & Design 2018-01-07 1.0.0 4.0.3 and up
1 Coloring book moana ART_AND_DESIGN 3.9 967 14M 500,000+ Free 0.0 Everyone Art & Design;Pretend Play 2018-01-15 2.0.0 4.0.3 and up
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 8.7M 5,000,000+ Free 0.0 Everyone Art & Design 2018-08-01 1.2.4 4.0.3 and up
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 25M 50,000,000+ Free 0.0 Teen Art & Design 2018-06-08 Varies with device 4.2 and up
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 967 2.8M 100,000+ Free 0.0 Everyone Art & Design;Creativity 2018-06-20 1.1 4.4 and up
... ... ... ... ... ... ... ... ... ... ... ... ... ...
10836 Sya9a Maroc - FR FAMILY 4.5 38 53M 5,000+ Free 0.0 Everyone Education 2017-07-25 1.48 4.1 and up
10837 Fr. Mike Schmitz Audio Teachings FAMILY 5.0 4 3.6M 100+ Free 0.0 Everyone Education 2018-07-06 1.0 4.1 and up
10838 Parkinson Exercices FR MEDICAL 4.3 3 9.5M 1,000+ Free 0.0 Everyone Medical 2017-01-20 1.0 2.2 and up
10839 The SCP Foundation DB fr nn5n BOOKS_AND_REFERENCE 4.5 114 Varies with device 1,000+ Free 0.0 Mature 17+ Books & Reference 2015-01-19 Varies with device Varies with device
10840 iHoroscope - 2018 Daily Horoscope & Astrology LIFESTYLE 4.5 398307 19M 10,000,000+ Free 0.0 Everyone Lifestyle 2018-07-25 Varies with device Varies with device

8896 rows × 13 columns

In [391]:
ps_df[ps_df['Price']!=0]
Out[391]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
234 TurboScan: scan documents and receipts in PDF BUSINESS 4.7 11442 6.8M 100,000+ Paid 4.99 Everyone Business 2018-03-25 1.5.2 4.0 and up
235 Tiny Scanner Pro: PDF Doc Scan BUSINESS 4.8 10295 39M 100,000+ Paid 4.99 Everyone Business 2017-04-11 3.4.6 3.0 and up
427 Puffin Browser Pro COMMUNICATION 4.0 18247 Varies with device 100,000+ Paid 3.99 Everyone Communication 2018-07-05 7.5.3.20547 4.1 and up
476 Moco+ - Chat, Meet People DATING 4.2 1545 Varies with device 10,000+ Paid 3.99 Mature 17+ Dating 2018-06-19 2.6.139 4.1 and up
477 Calculator DATING 2.6 57 6.2M 1,000+ Paid 6.99 Everyone Dating 2017-10-25 1.1.6 4.0 and up
... ... ... ... ... ... ... ... ... ... ... ... ... ...
10735 FP VoiceBot FAMILY 4.3 17 157k 100+ Paid 0.99 Mature 17+ Entertainment 2015-11-25 1.2 2.1 and up
10760 Fast Tract Diet HEALTH_AND_FITNESS 4.4 35 2.4M 1,000+ Paid 7.99 Everyone Health & Fitness 2018-08-08 1.9.3 4.2 and up
10782 Trine 2: Complete Story GAME 3.8 252 11M 10,000+ Paid 16.99 Teen Action 2015-02-27 2.22 5.0 and up
10785 sugar, sugar FAMILY 4.2 1405 9.5M 10,000+ Paid 1.20 Everyone Puzzle 2018-06-05 2.7 2.3 and up
10798 Word Search Tab 1 FR FAMILY 4.3 0 1020k 50+ Paid 1.04 Everyone Puzzle 2012-02-06 1.1 3.0 and up

753 rows × 13 columns

We have successfully converted the datatype of values in the Price column from string to float.

4). Converting the values in theInstallscolumn from string datatype to integer datatype.¶

In [392]:
s1 = "1,000,00+"
In [393]:
s1
Out[393]:
'1,000,00+'
In [394]:
s1 = s1.replace(",","")
In [395]:
s1
Out[395]:
'100000+'
In [396]:
s1[ 0: -1]
Out[396]:
'100000'
In [397]:
# Checking the contents of the 'Installs' column
ps_df['Installs'].value_counts()
Out[397]:
1,000,000+        1416
100,000+          1112
10,000+           1029
10,000,000+        937
1,000+             886
100+               709
5,000,000+         607
500,000+           504
50,000+            468
5,000+             467
10+                384
500+               328
50+                204
50,000,000+        202
100,000,000+       188
5+                  82
1+                  67
500,000,000+        24
1,000,000,000+      20
0+                  14
0                    1
Name: Installs, dtype: int64

To convert all the values in the Installs column from string datatype to integer datatype, we must first drop the '+' symbol from all the entries if present and then we can change its datatype.

Applying the convert_plus function to convert the values in the Installs column from string datatype to float datatype.

In [398]:
# Creating a function convert_plus which drops the '+' symbol if it is present and returns the output which is of integer datatype.

def convert_plus(val):
    if '+' and ',' in val:
        new = int(val[:-1].replace(',',''))
        return new
    elif '+' in val:
        new1 = int(val[:-1])
        return new1
    else:
        return int(val)
In [399]:
# The drop_plus funtion applied to the main dataframe

ps_df['Installs'] = ps_df['Installs'].apply(lambda x: convert_plus(x))
ps_df.head()
Out[399]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19M 10000 Free 0.0 Everyone Art & Design 2018-01-07 1.0.0 4.0.3 and up
1 Coloring book moana ART_AND_DESIGN 3.9 967 14M 500000 Free 0.0 Everyone Art & Design;Pretend Play 2018-01-15 2.0.0 4.0.3 and up
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 8.7M 5000000 Free 0.0 Everyone Art & Design 2018-08-01 1.2.4 4.0.3 and up
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 25M 50000000 Free 0.0 Teen Art & Design 2018-06-08 Varies with device 4.2 and up
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 967 2.8M 100000 Free 0.0 Everyone Art & Design;Creativity 2018-06-20 1.1 4.4 and up
In [400]:
ps_df['Installs']
Out[400]:
0           10000
1          500000
2         5000000
3        50000000
4          100000
           ...   
10836        5000
10837         100
10838        1000
10839        1000
10840    10000000
Name: Installs, Length: 9649, dtype: int64
In [401]:
ps_df['Installs'].dtype
Out[401]:
dtype('int64')

he resultant values in the Installs column are of the integer datatype, and it represents the least number of times a particular app has been installed.

  • Installs = 0 indicates that that particular app has not been installed by anyone yet.
  • Installs = 1 indicates that the particular app has been installed by atleast one user.
  • Installs = 1000000 indicates that the particular app has been installed by atleast one million users. So on and so forth.
  • We have successfully converted the datatype of values in the Installs column from string to int.

5). Converting the values in theSizecolumn to a same unit of measure(MB).¶

In [402]:
ps_df['Size'].value_counts()
Out[402]:
Varies with device    1227
12M                    181
11M                    181
13M                    177
14M                    176
                      ... 
721k                     1
430k                     1
429k                     1
200k                     1
619k                     1
Name: Size, Length: 457, dtype: int64

We can see that the values in the Size column contains data with different units. 'M' stands for MB and 'k' stands for KB. To easily analyse this column, it is necessary to convert all the values to a single unit. In this case, we will convert all the units to MB.

We know that 1MB = 1024KB, to convert KB to MB, we must divide all the values which are in KB by 1024.

In [403]:
# Defining a function to convert all the entries in KB to MB and then converting them to float datatype.

def convert_kb_to_mb(val):
    try:
        if 'M' in val:
            return float(val[:-1])
        elif 'k' in val:
            return round(float(val[:-1])/1024, 4)
        else:
            return val
    except:
        return val

Applying the kb_to_mb function to convert the values in the Size column to a single unit of measure (MB) and the datatype from string to float.

In [404]:
# The kb_to_mb funtion applied to the size column

ps_df['Size'] = ps_df['Size'].apply(lambda x: convert_kb_to_mb(x))
ps_df.head()
Out[404]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19.0 10000 Free 0.0 Everyone Art & Design 2018-01-07 1.0.0 4.0.3 and up
1 Coloring book moana ART_AND_DESIGN 3.9 967 14.0 500000 Free 0.0 Everyone Art & Design;Pretend Play 2018-01-15 2.0.0 4.0.3 and up
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 8.7 5000000 Free 0.0 Everyone Art & Design 2018-08-01 1.2.4 4.0.3 and up
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 25.0 50000000 Free 0.0 Teen Art & Design 2018-06-08 Varies with device 4.2 and up
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 967 2.8 100000 Free 0.0 Everyone Art & Design;Creativity 2018-06-20 1.1 4.4 and up
In [405]:
ps_df['Size'].dtype
Out[405]:
dtype('O')
In [406]:
ps_df['Size'] = ps_df['Size'].apply(lambda x: str(x).replace('Varies with device', 'NaN') if 'Varies with device' in str(x) else x)
In [407]:
ps_df['Size'].value_counts()
Out[407]:
NaN       1227
12.0       181
11.0       181
13.0       177
14.0       176
          ... 
0.7041       1
0.4199       1
0.4189       1
0.1953       1
0.6045       1
Name: Size, Length: 456, dtype: int64
In [408]:
ps_df['Size'] = ps_df['Size'].apply(lambda x: float(x))
In [409]:
ps_df['Size'].dtype
Out[409]:
dtype('float64')
In [410]:
ps_df['Size'].mean(), ps_df['Size'].median()
Out[410]:
(20.413555699358753, 12.0)
In [411]:
ps_df['Size'].max()
Out[411]:
100.0
In [412]:
 ps_df[ps_df['Size'] != 'Varies with device']['Size'].max()
Out[412]:
100.0
In [413]:
round(ps_df['Size'].mean(),4)
Out[413]:
20.4136

A vast majority of the entries in the Size column contain the entry Varies with device. Since this entry cannot be used for analysis lets see if it can be imputed with the mean or median value of the entries in this column.

In [414]:
# Finding max, min, mean, and median in the Size column excluding the 'Varies with device' values.

max_size = ps_df['Size'].max()

min_size = ps_df['Size'].min()

mean_size = round(ps_df['Size'].mean(),4)

median_size = ps_df['Size'].median()

[max_size, min_size, mean_size, median_size]
Out[414]:
[100.0, 0.0083, 20.4136, 12.0]

Visualization of distribution of `Size` using displot and detecting the outliers through boxplot.

In [415]:
ps_df[['Size']].boxplot()
Out[415]:
<AxesSubplot:>
In [416]:
# Distplot
fig, ax = plt.subplots(2,1, figsize=(12,7))
sns.distplot(ps_df[ps_df['Size'] != 'Varies with device']['Size'], color='purple', ax=ax[0])
sns.boxplot(x='Size',data=ps_df, ax=ax[1])
Out[416]:
<AxesSubplot:xlabel='Size'>
  • It is clear from the visualizations that the data in the Size column is skewed towards the right.
  • Also, we see that a vast majority of the entries in this column are of the value Varies with device, replacing this with any central tendency value (mean or median) may give incorrect visualizations and results. Hence these values are left as it is.

  • We have successfully converted all the valid entries in the Size column to a single unit of measure (MB) and the datatype from string to float.

In [417]:
ps_df['Size'].fillna(ps_df['Size'].median(),inplace=True)

6). Converting the datatype of values in theReviewscolumn from string to int.¶

In [418]:
# Converting the datatype of the values in the reviews column from string to int
ps_df['Reviews'] = ps_df['Reviews'].astype(int)
ps_df.head()
Out[418]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19.0 10000 Free 0.0 Everyone Art & Design 2018-01-07 1.0.0 4.0.3 and up
1 Coloring book moana ART_AND_DESIGN 3.9 967 14.0 500000 Free 0.0 Everyone Art & Design;Pretend Play 2018-01-15 2.0.0 4.0.3 and up
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 8.7 5000000 Free 0.0 Everyone Art & Design 2018-08-01 1.2.4 4.0.3 and up
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 25.0 50000000 Free 0.0 Teen Art & Design 2018-06-08 Varies with device 4.2 and up
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 967 2.8 100000 Free 0.0 Everyone Art & Design;Creativity 2018-06-20 1.1 4.4 and up
In [419]:
ps_df.describe()
Out[419]:
Rating Reviews Size Installs Price
count 9649.000000 9.649000e+03 9649.000000 9.649000e+03 9649.000000
mean 4.192476 2.168145e+05 19.343659 7.785404e+06 1.100079
std 0.496528 1.832255e+06 20.589648 5.378557e+07 16.860857
min 1.000000 0.000000e+00 0.008300 0.000000e+00 0.000000
25% 4.000000 2.500000e+01 5.300000 1.000000e+03 0.000000
50% 4.300000 9.690000e+02 12.000000 1.000000e+05 0.000000
75% 4.500000 2.944500e+04 25.000000 1.000000e+06 0.000000
max 5.000000 7.815831e+07 100.000000 1.000000e+09 400.000000

We have successfully converted the datatype of the values in the Reviews column from string to int.

Now that we have handled the errors and NaN values in the playstoredata.csv file, lets do the same for the userreviews.csv file.

Data Exploration--Univariate & Bivariate Analysis¶

Pair plot is used to understand the best set of features to explain a relationship between two variables or to form the most separated clusters. It also helps to form some simple classification models by drawing some simple lines or make linear separation in our data-set.

Plot a pairwise plot between all the quantitative variables to look for any evident patterns or relationships between the features

In [420]:
Rating = ps_df['Rating']
Size = ps_df['Size']
Installs = ps_df['Installs']
Reviews = ps_df['Reviews']
Type = ps_df['Type']
Price = ps_df['Price']

p = sns.pairplot(pd.DataFrame(list(zip(Rating, Size, np.log(Installs), np.log10(Reviews), Price, Type)), 
                        columns=['Rating','Size', 'Installs', 'Reviews', 'Price','Type']), hue='Type')
p.fig.suptitle("Pairwise Plot - Rating, Size, Installs, Reviews, Price",x=0.5, y=1.0, fontsize=16)
Out[420]:
Text(0.5, 1.0, 'Pairwise Plot - Rating, Size, Installs, Reviews, Price')

FINDINGS

  • Most of the App are Free.
  • Most of the Paid Apps have Rating around 4
  • As the number of installation increases the number of reviews of the particaular app also increases. *Most of the Apps are light-weighted.
In [421]:
def plot_number_category(data):
    sns.set(style="whitegrid")  # Set the style to whitegrid
    fig, ax = plt.subplots()
    fig.set_size_inches(15, 7)
    ax.set_facecolor('yellow')  # Set background color to yellow
    sns.countplot(data['Category'], ax=ax, palette='pastel')  # Using Seaborn's pastel color palette
    plt.xticks(rotation=90)
    
    # Add count values on top of each bar
    for p in ax.patches:
        ax.annotate(format(p.get_height(), '.0f'), 
                    (p.get_x() + p.get_width() / 2., p.get_height()), 
                    ha='center', va='center', 
                    xytext=(0, 5), 
                    textcoords='offset points')
    
    plt.show()

# Assuming 'playstore_data' is your DataFrame containing the data
plot_number_category(ps_df)

☘ Let us see what insight we can have on the basis of Size of an app¶

Size vs Rating¶

In [422]:
sns.set_style("whitegrid", {'axes.grid' : False})
sns.lmplot(y='Rating',x='Size',data=ps_df,col="Category", hue="Category",col_wrap=4,line_kws={'color': 'red'})
Out[422]:
<seaborn.axisgrid.FacetGrid at 0x2eb46571a90>
In [423]:
# Get the top 5 categories based on the number of installations
top5_cat = ps_df.groupby('Category')['Installs'].sum().nlargest(5).index.tolist()

# Filter the data for the top 5 categories
data_top5 = ps_df.groupby('Category')['Installs'].sum().loc[top5_cat].reset_index(name='Number_Installations')

# Plotting
plt.figure(figsize=(12, 6));  # Increase figure width
plt.title('Comparing top 5 categories based on Installs', color='white');  # Set title color to white
bar_plot = sns.barplot(y=data_top5['Category'], x=data_top5['Number_Installations'], palette='viridis');

# Annotate each bar with its corresponding value
for index, value in enumerate(data_top5['Number_Installations']):
    plt.text(value, index, f'{value:,}', va='center', fontsize=12, color='white')  # Set text color to white

plt.xlabel('Number of Installations', color='white');  # Set xlabel color to white
plt.ylabel('Category', color='white');  # Set ylabel color to white
plt.gca().set_facecolor('red');  # Set background color to red
plt.grid(axis='x', linestyle='--', alpha=0.7);  # Add gridlines on x-axis

# Adjust x-axis limits to ensure all numbers are visible
plt.xlim(right=data_top5['Number_Installations'].max() * 1.1);  # Extend the limit by 10% on the right side
plt.show(bar_plot);
In [309]:
# Grouping by Content Rating and calculating total installations
data_cont = ps_df.groupby('Content Rating')['Installs'].sum().reset_index(name='Number_Installations');

# Plotting
plt.figure(figsize=(10, 5));
plt.title('Total Installations by Content Rating', color='white');  # Set title color to white
bar_plot = sns.barplot(x=data_cont['Content Rating'], y=data_cont['Number_Installations'], palette='viridis');

# Annotate each bar with its corresponding value
for index, value in enumerate(data_cont['Number_Installations']):
    plt.text(index, value, f'{value:,}', ha='center', fontsize=12, color='red');  # Set text color to red

plt.xlabel('Content Rating', color='white');  # Set xlabel color to white
plt.ylabel('Number of Installations', color='white');  # Set ylabel color to white
plt.gca().set_facecolor('yellow');  # Set background color to black
plt.grid(axis='y', linestyle='--', alpha=0.7);  # Add gridlines on y-axis

plt.show(bar_plot);
In [424]:
ps_df.groupby('Content Rating')['Installs'].sum()
Out[424]:
Content Rating
Adults only 18+        2000000
Everyone           52177775851
Everyone 10+        4016271795
Mature 17+          2437986878
Teen               16487275393
Unrated                  50500
Name: Installs, dtype: int64
In [425]:
# Get the top 5 apps based on the number of installations
top_app = ps_df.groupby('App').size().reset_index(name='Count').nlargest(5, 'Count');
top5_app = top_app['App'].tolist();

# Filter the data for the top 5 apps
data_app = ps_df.groupby('App')['Installs'].sum().loc[top5_app].reset_index(name='Number_Installations');

# Plotting
plt.figure(figsize=(10, 5));
plt.title('Top 5 Apps by Installations', color='blue');  # Set title color to blue
bar_plot = sns.barplot(x=data_app['Number_Installations'], y=data_app['App'], palette='YlGnBu');  # Set palette to Yellow-Green-Blue

# Annotate each bar with its corresponding value
for index, value in enumerate(data_app['Number_Installations']):
    plt.text(value, index, f'{value:,}', va='center', fontsize=12, color='black');  # Set text color to black

plt.xlabel('Number of Installations', color='blue');  # Set xlabel color to blue
plt.ylabel('App', color='blue');  # Set ylabel color to blue
plt.gca().set_facecolor('yellow');  # Set background color to yellow
plt.grid(axis='x', linestyle='--', alpha=0.7);  # Add gridlines on x-axis

plt.show(bar_plot);

Exploring User_review dataframe¶

In [426]:
# Reading the userreviews.csv file
ur_df=pd.read_csv(r"E:\0001Almabetter\2.numerical python programming\project-eda-numeric-python\User Reviews.csv")
In [427]:
# Checking the top 10 rows of the data

ur_df.head()
Out[427]:
App Translated_Review Sentiment Sentiment_Polarity Sentiment_Subjectivity
0 10 Best Foods for You I like eat delicious food. That's I'm cooking ... Positive 1.00 0.533333
1 10 Best Foods for You This help eating healthy exercise regular basis Positive 0.25 0.288462
2 10 Best Foods for You NaN NaN NaN NaN
3 10 Best Foods for You Works great especially going grocery store Positive 0.40 0.875000
4 10 Best Foods for You Best idea us Positive 1.00 0.300000
In [428]:
ur_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64295 entries, 0 to 64294
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   App                     64295 non-null  object 
 1   Translated_Review       37427 non-null  object 
 2   Sentiment               37432 non-null  object 
 3   Sentiment_Polarity      37432 non-null  float64
 4   Sentiment_Subjectivity  37432 non-null  float64
dtypes: float64(2), object(3)
memory usage: 2.5+ MB
In [429]:
# Checking shape and column in dataframe
print(ur_df.columns)
rows=ur_df.shape[0]
columns=ur_df.shape[1]
print(f"the no of rows is {rows} and no of columns is {columns}")
Index(['App', 'Translated_Review', 'Sentiment', 'Sentiment_Polarity',
       'Sentiment_Subjectivity'],
      dtype='object')
the no of rows is 64295 and no of columns is 5

Let us first define what information the columns contain based on our inspection.

user_reviews dataframe has 64295 rows and 5 columns. The 5 columns are identified as follows:

  • App: Contains the name of the app with a short description (optional).
  • Translated_Review: It contains the English translation of the review dropped by the user of the app.
  • Sentiment: It gives the attitude/emotion of the writer. It can be ‘Positive’, ‘Negative’, or ‘Neutral’.
  • Sentiment_Polarity: It gives the polarity of the review. Its range is [-1,1], where 1 means ‘Positive statement’ and -1 means a ‘Negative statement’.
  • Sentiment_Subjectivity: This value gives how close a reviewers opinion is to the opinion of the general public. Its range is [0,1]. Higher the subjectivity, closer is the reviewers opinion to the opinion of the general public, and lower subjectivity indicates the review is more of a factual information.
In [430]:
def Urinfo():
    temp1=pd.DataFrame(index=ur_df.columns)
    temp1["datatype"]=ur_df.dtypes
    temp1["not null values"]=ur_df.count()
    temp1["null value"]=ur_df.isnull().sum()
    temp1["% of the null value"]=ur_df.isnull().mean().round(4)*100
    temp1["unique count"]=ur_df.nunique()
    return temp1
Urinfo()
Out[430]:
datatype not null values null value % of the null value unique count
App object 64295 0 0.00 1074
Translated_Review object 37427 26868 41.79 27994
Sentiment object 37432 26863 41.78 3
Sentiment_Polarity float64 37432 26863 41.78 5410
Sentiment_Subjectivity float64 37432 26863 41.78 4474

Findings

The number of null values are:

  • Translated_Review has 26868 null values which contributes 41.79% of the data.
  • Sentiment has 26863 null values which contributes 41.78% of the data.
  • Sentiment_Polarity has 26863 null values which contributes 41.78% of the data.
  • Sentiment_Subjectivity has 26863 null values which contributes 41.78% of the data.

Handling the error and NaN values in the User reviews¶

In [431]:
# Finding the total no of NaN values in each column.
ur_df.isnull().sum()
Out[431]:
App                           0
Translated_Review         26868
Sentiment                 26863
Sentiment_Polarity        26863
Sentiment_Subjectivity    26863
dtype: int64

There are a lot of NaN values. We need to analyse these values and see how we can handle them.

In [432]:
# checking the NaN values in the translated rview column
ur_df[ur_df['Translated_Review'].isnull()]
Out[432]:
App Translated_Review Sentiment Sentiment_Polarity Sentiment_Subjectivity
2 10 Best Foods for You NaN NaN NaN NaN
7 10 Best Foods for You NaN NaN NaN NaN
15 10 Best Foods for You NaN NaN NaN NaN
102 10 Best Foods for You NaN NaN NaN NaN
107 10 Best Foods for You NaN NaN NaN NaN
... ... ... ... ... ...
64290 Houzz Interior Design Ideas NaN NaN NaN NaN
64291 Houzz Interior Design Ideas NaN NaN NaN NaN
64292 Houzz Interior Design Ideas NaN NaN NaN NaN
64293 Houzz Interior Design Ideas NaN NaN NaN NaN
64294 Houzz Interior Design Ideas NaN NaN NaN NaN

26868 rows × 5 columns

There are a total of 26868 rows containing NaN values in the Translated_Review column.

We can say that the apps which do not have a review (NaN value insted) tend to have NaN values in the columns Sentiment, Sentiment_Polarity, and Sentiment_Subjectivity in the majority of the cases.

Lets check if there are any exceptions.

In [433]:
# The rows corresponding to the NaN values in the translated_review column, where the rest of the columns are non null.
ur_df[ur_df['Translated_Review'].isnull() & ur_df['Sentiment'].notna()]
Out[433]:
App Translated_Review Sentiment Sentiment_Polarity Sentiment_Subjectivity
268 11st NaN Neutral 0.0 0.0
15048 Birds Sounds Ringtones & Wallpapers NaN Neutral 0.0 0.0
22092 Calorie Counter - MyFitnessPal NaN Neutral 0.0 0.0
31623 DC Comics NaN Neutral 0.0 0.0
52500 Garden Photo Frames - Garden Photo Editor NaN Neutral 0.0 0.0

In the few exceptional cases where the values of remaining columns are non null for null values in the translated_Review column, there seems to be errors. This is because the Sentiment, sentiment ploarity and sentiment subjectivity of the review can be determined if and only if there is a corresponding review.

Hence these values are wrong and can be deleted altogather.

In [434]:
# Deleting the rows containing NaN values
ur_df = ur_df.dropna()
In [435]:
# The shape of the updated df
ur_df.shape
Out[435]:
(37427, 5)
In [436]:
ur_df.iloc[1:22]
Out[436]:
App Translated_Review Sentiment Sentiment_Polarity Sentiment_Subjectivity
1 10 Best Foods for You This help eating healthy exercise regular basis Positive 0.250000 0.288462
3 10 Best Foods for You Works great especially going grocery store Positive 0.400000 0.875000
4 10 Best Foods for You Best idea us Positive 1.000000 0.300000
5 10 Best Foods for You Best way Positive 1.000000 0.300000
6 10 Best Foods for You Amazing Positive 0.600000 0.900000
8 10 Best Foods for You Looking forward app, Neutral 0.000000 0.000000
9 10 Best Foods for You It helpful site ! It help foods get ! Neutral 0.000000 0.000000
10 10 Best Foods for You good you. Positive 0.700000 0.600000
11 10 Best Foods for You Useful information The amount spelling errors ... Positive 0.200000 0.100000
12 10 Best Foods for You Thank you! Great app!! Add arthritis, eyes, im... Positive 0.750000 0.875000
13 10 Best Foods for You Greatest ever Completely awesome maintain heal... Positive 0.992188 0.866667
14 10 Best Foods for You Good health...... Good health first priority..... Positive 0.550000 0.511111
16 10 Best Foods for You Health It's important world either life . thin... Positive 0.450000 1.000000
17 10 Best Foods for You Mrs sunita bhati I thankful developers,to make... Positive 0.600000 0.666667
18 10 Best Foods for You Very Useful in diabetes age 30. I need control... Positive 0.295000 0.100000
19 10 Best Foods for You One greatest apps. Positive 1.000000 1.000000
20 10 Best Foods for You good nice Positive 0.650000 0.800000
21 10 Best Foods for You Healthy Really helped Positive 0.350000 0.350000
22 10 Best Foods for You God health Neutral 0.000000 0.000000
23 10 Best Foods for You HEALTH SHOULD ALWAYS BE TOP PRIORITY. !!. ON M... Positive 0.781250 0.500000
24 10 Best Foods for You An excellent A useful Positive 0.650000 0.500000

There are a total of 37427 rows in the updated df.

Hence we have taken care of all the NaN values in the df.

Lets inspect the updated df

In [437]:
# Inspecting the sentiment column
ur_df['Sentiment'].value_counts()
Out[437]:
Positive    23998
Negative     8271
Neutral      5158
Name: Sentiment, dtype: int64

The values in the Sentiment_Polarity and Sentiment_Subjectivitylooks correct.

On the given datasets, we successfully developed a data pipeline. We can now examine this data flow and create user-friendly visuals. It is easy to compare different measures using the visualizations, and thus to draw implications from them.

Data Visualization on play store data:¶

We have sucessfully cleaned the dirty data. Now we can perform some data visualization and come up with insights on the given datasets.

1). Correlation Heatmap¶

In [438]:
# Finding correlation between different columns in the play store data
ps_df.corr()
Out[438]:
Rating Reviews Size Installs Price
Rating 1.000000 0.050212 0.037378 0.034306 -0.018674
Reviews 0.050212 1.000000 0.066152 0.625158 -0.007603
Size 0.037378 0.066152 1.000000 0.030474 -0.019589
Installs 0.034306 0.625158 0.030474 1.000000 -0.009412
Price -0.018674 -0.007603 -0.019589 -0.009412 1.000000
In [439]:
# Heat map for play_store
plt.figure(figsize = (20,10))
sns.heatmap(ps_df.corr(), annot= True)
plt.title('Corelation Heatmap for Playstore Data', size=20)
Out[439]:
Text(0.5, 1.0, 'Corelation Heatmap for Playstore Data')
  • There is a strong positive correlation between the Reviews and Installs column. This is pretty much obvious. Higher the number of installs, higher is the user base, and higher are the total number of reviews dropped by the users.
  • ThePriceis slightly negatively correlated with the Rating, Reviews, and Installs. This means that as the prices of the app increases, the average rating, total number of reviews and Installs fall slightly.
  • TheRating is slightly positively correlated with theInstalls and Reviews column. This indicates that as the the average user rating increases, the app installs and number of reviews also increase.

Let us check if there is any co-relation in both the dataframes.¶

In [440]:
merged_df = pd.merge(ps_df, ur_df, on='App', how = "inner")
In [441]:
merged_df.shape
Out[441]:
(35929, 17)
In [442]:
def merged_dfinfo():
    temp = pd.DataFrame(index=merged_df.columns)
    temp['data_type'] = merged_df.dtypes
    temp["count of non null values"] = merged_df.count()
    temp['NaN values'] = merged_df.isnull().sum()
    temp['% NaN values'] =merged_df.isnull().mean()
    temp['unique_count'] = merged_df.nunique() 
    return temp
merged_dfinfo()
Out[442]:
data_type count of non null values NaN values % NaN values unique_count
App object 35929 0 0.0 816
Category object 35929 0 0.0 33
Rating float64 35929 0 0.0 22
Reviews int32 35929 0 0.0 807
Size float64 35929 0 0.0 166
Installs int64 35929 0 0.0 12
Type object 35929 0 0.0 2
Price float64 35929 0 0.0 9
Content Rating object 35929 0 0.0 5
Genres object 35929 0 0.0 67
Last Updated datetime64[ns] 35929 0 0.0 247
Current Ver object 35929 0 0.0 498
Android Ver object 35929 0 0.0 22
Translated_Review object 35929 0 0.0 26682
Sentiment object 35929 0 0.0 3
Sentiment_Polarity float64 35929 0 0.0 5295
Sentiment_Subjectivity float64 35929 0 0.0 4382
In [443]:
merged_df.corr()
Out[443]:
Rating Reviews Size Installs Price Sentiment_Polarity Sentiment_Subjectivity
Rating 1.000000 0.075736 0.091094 0.020145 -0.010055 0.092903 0.068758
Reviews 0.075736 1.000000 0.190686 0.564256 -0.020591 -0.080021 -0.009315
Size 0.091094 0.190686 1.000000 0.040817 0.002484 -0.118398 0.013460
Installs 0.020145 0.564256 0.040817 1.000000 -0.025213 -0.057842 -0.006307
Price -0.010055 -0.020591 0.002484 -0.025213 1.000000 0.024148 0.003182
Sentiment_Polarity 0.092903 -0.080021 -0.118398 -0.057842 0.024148 1.000000 0.259668
Sentiment_Subjectivity 0.068758 -0.009315 0.013460 -0.006307 0.003182 0.259668 1.000000
In [444]:
# Correlation heatmap
# Heat Map for the merged data frame
plt.figure(figsize = (15,10))
sns.heatmap(merged_df.corr(), annot= True, cmap='Greens')
plt.title(' Heatmap for merged Dataframe', size=20)
Out[444]:
Text(0.5, 1.0, ' Heatmap for merged Dataframe')
In [445]:
merged_df = merged_df.dropna(subset=['Sentiment', 'Translated_Review'])
In [446]:
merged_df.shape
Out[446]:
(35929, 17)
In [447]:
merged_df.isna().sum()
Out[447]:
App                       0
Category                  0
Rating                    0
Reviews                   0
Size                      0
Installs                  0
Type                      0
Price                     0
Content Rating            0
Genres                    0
Last Updated              0
Current Ver               0
Android Ver               0
Translated_Review         0
Sentiment                 0
Sentiment_Polarity        0
Sentiment_Subjectivity    0
dtype: int64

2). What is the ratio of number of Paid apps and Free apps?¶

In [448]:
data = ps_df['Type'].value_counts() 
labels = ['Free', 'Paid']

# create pie chart
plt.figure(figsize=(10,10))
colors = ["#00EE76","#7B8895"]
explode=(0.01,0.1)
plt.pie(data, labels = labels, colors = colors, autopct='%.2f%%',explode=explode,textprops={'fontsize': 15})
plt.title('Distribution of Paid and Free apps',size=15,loc='center')
plt.legend()
Out[448]:
<matplotlib.legend.Legend at 0x2eb4cbc5fd0>

Findings:

From the above graph we can see that 92% of apps in google play store are free and 8%are paid.

In [449]:
ps_df['Content Rating'].unique()
Out[449]:
array(['Everyone', 'Teen', 'Everyone 10+', 'Mature 17+',
       'Adults only 18+', 'Unrated'], dtype=object)

3). Which category of Apps from the Content Rating column are found more on playstore ?¶

In [450]:
# Content rating of the apps
data = ps_df['Content Rating'].value_counts()
labels = ['Everyone', 'Teen', 'Everyone 10+', 'Mature 17+','Adults only 18+', 'Unrated']

#create pie chart
plt.figure(figsize=(10,10))
explode=(0,0.1,0.1,0.1,0.0,1.3)
colors = ['C4', 'r', 'c', 'g', 'm', 'k']
plt.pie(data, labels = labels, colors = colors, autopct='%.2f%%',explode=explode,textprops={'fontsize': 15})
plt.title('Content Rating',size=20,loc='center')
plt.legend()
Out[450]:
<matplotlib.legend.Legend at 0x2eb4c744580>

A majority of the apps (82%) in the play store are can be used by everyone.The remaining apps have various age restrictions to use it.

4).Top categories on Google Playstore?¶

In [451]:
ps_df.groupby("Category")["App"].count().sort_values(ascending= False)
Out[451]:
Category
FAMILY                 1829
GAME                    959
TOOLS                   825
BUSINESS                420
MEDICAL                 395
PERSONALIZATION         374
PRODUCTIVITY            374
LIFESTYLE               369
FINANCE                 345
SPORTS                  325
COMMUNICATION           315
HEALTH_AND_FITNESS      288
PHOTOGRAPHY             281
NEWS_AND_MAGAZINES      254
SOCIAL                  239
BOOKS_AND_REFERENCE     221
TRAVEL_AND_LOCAL        219
SHOPPING                202
DATING                  171
VIDEO_PLAYERS           163
MAPS_AND_NAVIGATION     131
EDUCATION               119
FOOD_AND_DRINK          112
ENTERTAINMENT           102
AUTO_AND_VEHICLES        85
LIBRARIES_AND_DEMO       83
WEATHER                  79
HOUSE_AND_HOME           74
EVENTS                   64
ART_AND_DESIGN           63
PARENTING                60
COMICS                   56
BEAUTY                   53
Name: App, dtype: int64
In [452]:
x = ps_df['Category'].value_counts()
y = ps_df['Category'].value_counts().index
x_list = []
y_list = []
for i in range(len(x)):
    x_list.append(x[i])
    y_list.append(y[i])
In [453]:
#Number of apps belonging to each category in the playstore
plt.figure(figsize=(20,10))
plt.xlabel('Number of Apps', size=15)
plt.ylabel('App Categories', size=15)
graph = sns.barplot(y = x_list, x = y_list, palette= "tab10")
graph.set_title("Top categories on Playstore", fontsize = 25)
graph.set_xticklabels(graph.get_xticklabels(), rotation= 45, horizontalalignment='right',);

Findings:

So there are all total 33 categories in the dataset From the above output we can come to a conclusion that in playstore most of the apps are underFAMILY & GAME category and least are of EVENTS & BEAUTY Category.

In [454]:
# Percentage of apps belonging to each category in the playstore
plt.figure(figsize=(18,18))
plt.pie(ps_df.Category.value_counts(), labels=ps_df.Category.value_counts().index, autopct='%1.2f%%')
my_circle = plt.Circle( (0,0), 0.50, color='white')
p=plt.gcf()
p.gca().add_artist(my_circle)
plt.title('% of apps share in each Category', fontsize = 25)
plt.show()

5). Which category App's have most number of installs?¶

In [455]:
# total app installs in each category of the play store

a = ps_df.groupby(['Category'])['Installs'].sum().sort_values()
a.plot.barh(figsize=(15,10), color = 'c', )
plt.ylabel('Total app Installs', fontsize = 15)
plt.xlabel('App Categories', fontsize = 15)
plt.xticks()
plt.title('Total app installs in each category', fontsize = 20)
Out[455]:
Text(0.5, 1.0, 'Total app installs in each category')

Findings:

This tells us the category of apps that has the maximum number of installs. The Game, Communication and Tools categories has the highest number of installs compared to other categories of apps.

6). Average rating of the apps¶

In [456]:
# Average app ratings

ps_df['Rating'].value_counts().plot.bar(figsize=(20,8), color = 'm' )
plt.xlabel('Average rating',fontsize = 15 )
plt.ylabel('Number of apps', fontsize = 15)
plt.title('Average rating of apps in Playstore', fontsize = 20)
plt.legend()
Out[456]:
<matplotlib.legend.Legend at 0x2eb4cd86e20>

We can represent the ratings in a better way if we group the ratings between certain intervals. Here, we can group the rating as follows:

  • 4-5: Top rated
  • 3-4: Above average
  • 2-3: Average
  • 1-2: Below average

Lets create a new column Rating group in the main dataframe and apply these filters.

In [457]:
# Defining a function grouped_rating to group the ratings as mentioned above
def Rating_app(val):
  ''''
  This function help to categories the rating from 1 to 5
  as Top_rated,Above_average,Average & below Average
  '''
  if val>=4:
    return 'Top rated'
  elif val>3 and val<4:
    return 'Above Average'
  elif val>2 and val<3:
    return 'Average'
  else:
    return 'Below Average'

Lets apply the grouped_rating function on the Rating column and save the output in new column named as Rating group in the main df.

In [458]:
# Applying grouped_rating function
ps_df['Rating_group']=ps_df['Rating'].apply(lambda x: Rating_app(x))
In [459]:
# Average app ratings 
ps_df['Rating_group'].value_counts().plot.bar(figsize=(15,5), color = 'royalblue')
plt.xlabel('Rating Group', fontsize = 12)
plt.ylabel('Number of apps', fontsize = 12)
plt.title('Average app ratings', fontsize = 18)
plt.xticks(rotation=0)
plt.legend()
Out[459]:
<matplotlib.legend.Legend at 0x2eb4e8a9a00>

7). What are the Top 10 installed apps in any category?¶

In [460]:
def findtop10incategory(str):
    str = str.upper()
    top10 = ps_df[ps_df['Category'] == str]
    top10apps = top10.sort_values(by='Installs', ascending=False).head(10)
    plt.figure(figsize=(15,6), dpi=100)
    plt.title('Top 10 Installed Apps',size = 20)  
    graph = sns.barplot(x = top10apps.App, y = top10apps.Installs, palette= "icefire")
    graph.set_xticklabels(graph.get_xticklabels(), rotation= 45, horizontalalignment='right')
In [461]:
findtop10incategory('GAME')

Findings:

From the above graph we can see that in the Game category Subway Surfers,Candy Crush Saga, Temple Run 2 has the highest installs. In the same way we by passing different category names to the function, we can get the top 10 installed apps.

8). Top apps that are of free type.¶

In [462]:
 # Creating a df for only free apps
 
 free_df = ps_df[ps_df['Type'] == 'Free']
In [463]:
# Creating a df for top free apps

top_free_df = free_df[free_df['Installs'] == free_df['Installs'].max()]
top10free_apps=top_free_df.nlargest(10, 'Installs', keep='first')
top10free_apps.head(10)
Out[463]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver Rating_group
152 Google Play Books BOOKS_AND_REFERENCE 3.9 1433233 12.0 1000000000 Free 0.0 Teen Books & Reference 2018-08-03 Varies with device Varies with device Above Average
335 Messenger – Text and Video Chat for Free COMMUNICATION 4.0 56642847 12.0 1000000000 Free 0.0 Everyone Communication 2018-08-01 Varies with device Varies with device Top rated
336 WhatsApp Messenger COMMUNICATION 4.4 69119316 12.0 1000000000 Free 0.0 Everyone Communication 2018-08-03 Varies with device Varies with device Top rated
338 Google Chrome: Fast & Secure COMMUNICATION 4.3 9642995 12.0 1000000000 Free 0.0 Everyone Communication 2018-08-01 Varies with device Varies with device Top rated
340 Gmail COMMUNICATION 4.3 4604324 12.0 1000000000 Free 0.0 Everyone Communication 2018-08-02 Varies with device Varies with device Top rated
341 Hangouts COMMUNICATION 4.0 3419249 12.0 1000000000 Free 0.0 Everyone Communication 2018-07-21 Varies with device Varies with device Top rated
391 Skype - free IM & video calls COMMUNICATION 4.1 10484169 12.0 1000000000 Free 0.0 Everyone Communication 2018-08-03 Varies with device Varies with device Top rated
865 Google Play Games ENTERTAINMENT 4.3 7165362 12.0 1000000000 Free 0.0 Teen Entertainment 2018-07-16 Varies with device Varies with device Top rated
1654 Subway Surfers GAME 4.5 27722264 76.0 1000000000 Free 0.0 Everyone 10+ Arcade 2018-07-12 1.90.0 4.1 and up Top rated
2544 Facebook SOCIAL 4.1 78158306 12.0 1000000000 Free 0.0 Teen Social 2018-08-03 Varies with device Varies with device Top rated
In [464]:
# Top free apps

top_free_df['App']
Out[464]:
152                            Google Play Books
335     Messenger – Text and Video Chat for Free
336                           WhatsApp Messenger
338                 Google Chrome: Fast & Secure
340                                        Gmail
341                                     Hangouts
391                Skype - free IM & video calls
865                            Google Play Games
1654                              Subway Surfers
2544                                    Facebook
2545                                   Instagram
2554                                     Google+
2808                               Google Photos
3117                   Maps - Navigate & Explore
3127                          Google Street View
3234                                      Google
3454                                Google Drive
3665                                     YouTube
3687                     Google Play Movies & TV
3736                                 Google News
Name: App, dtype: object
In [465]:
# Categories in which the top 20 free apps belong to
top_free_df['Category'].value_counts().plot.bar(figsize=(20,6), color= ('darkcyan','blueviolet'))
plt.xlabel('Category', size=15)
plt.ylabel('Number of apps', size=15)
plt.title('Categories in which the top 20 free apps belong', size=19)
plt.xticks(rotation=45)
plt.legend()
Out[465]:
<matplotlib.legend.Legend at 0x2eb4f27c5e0>

9). Top apps that are of paid type.¶

In [466]:
# Creating a df containing only paid apps
paid_df=ps_df[ps_df['Type']=='Paid']
In [467]:
# Number of apps that can be installed at a particular price 

paid_df.groupby('Price')['App'].count().sort_values(ascending= False).plot.bar(figsize = (20,6), color = 'crimson')
Out[467]:
<AxesSubplot:xlabel='Price'>
  • The paid apps charge the users a certain amount to download and install the app. This amount varies from one app to another.
  • There are a lot of apps that charge a small amount whereas some apps charge a larger amount. In this case the price to download an app varies from USD 0.99 to USD 400.
  • In order to select the top paid apps, it won't be fair to look just into the numer of installs. This is because the apps that charge a lower installation fee will be installed by more number of people in general.
  • Here a better way to determine the top apps in the paid category is by finding the revenue it generated through app installs.
  • This is given by:

    Revenue generated through installs = (Number of installs)x(Price to install the app)

Lets define a new column Revenue in paid_df which gives the revenue generated by the app through installs alone.

In [468]:
# Creatng a new column 'Revenue' in paid_df
paid_df['Revenue'] = paid_df['Installs']*paid_df['Price']
paid_df.head()
Out[468]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver Rating_group Revenue
234 TurboScan: scan documents and receipts in PDF BUSINESS 4.7 11442 6.8 100000 Paid 4.99 Everyone Business 2018-03-25 1.5.2 4.0 and up Top rated 499000.0
235 Tiny Scanner Pro: PDF Doc Scan BUSINESS 4.8 10295 39.0 100000 Paid 4.99 Everyone Business 2017-04-11 3.4.6 3.0 and up Top rated 499000.0
427 Puffin Browser Pro COMMUNICATION 4.0 18247 12.0 100000 Paid 3.99 Everyone Communication 2018-07-05 7.5.3.20547 4.1 and up Top rated 399000.0
476 Moco+ - Chat, Meet People DATING 4.2 1545 12.0 10000 Paid 3.99 Mature 17+ Dating 2018-06-19 2.6.139 4.1 and up Top rated 39900.0
477 Calculator DATING 2.6 57 6.2 1000 Paid 6.99 Everyone Dating 2017-10-25 1.1.6 4.0 and up Average 6990.0
In [469]:
# Top app in the paid category

paid_df[paid_df['Revenue'] == paid_df['Revenue'].max()]
Out[469]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver Rating_group Revenue
2241 Minecraft FAMILY 4.5 2376564 12.0 10000000 Paid 6.99 Everyone 10+ Arcade;Action & Adventure 2018-07-24 1.5.2.1 Varies with device Top rated 69900000.0
In [470]:
# Top 10 paid apps in the play store
top10paid_apps=paid_df.nlargest(10, 'Revenue', keep='first')
top10paid_apps['App']
Out[470]:
2241                        Minecraft
5351                        I am rich
5356                I Am Rich Premium
4034                    Hitman Sniper
7417    Grand Theft Auto: San Andreas
2883              Facetune - For Free
5578          Sleep as Android Unlock
8804              DraStic DS Emulator
4367         I'm Rich - Trump Edition
4362                       💎 I'm rich
Name: App, dtype: object
In [471]:
# Categories in which the top 10 paid apps belong to
top10paid_apps['Category'].value_counts().plot.bar(figsize=(15,5), color= ["orange", "red", "green", "blue", "purple"])
plt.xlabel('Category',size=12)
plt.ylabel('Number of apps',size=12)
plt.title('Categories in which the top 10 paid apps belong', size=15)
plt.xticks(rotation=0)
plt.legend()
Out[471]:
<matplotlib.legend.Legend at 0x2eb4f4c0970>
In [472]:
# Top paid apps according to the revenue generated through installs alone
top10paid_apps.groupby('App')['Revenue'].mean().sort_values(ascending= True).plot.barh(figsize=(16,10), color='darkorange')
plt.xlabel('Revenue Generated (USD)', size=15)
plt.title('Top apps based on revenue generated through installation fee', size=20)
plt.legend()
Out[472]:
<matplotlib.legend.Legend at 0x2eb4f4c0640>
In [473]:
# Paid apps with the highest number of installs
paid_df[paid_df['Revenue'] == paid_df['Revenue'].max()]
Out[473]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver Rating_group Revenue
2241 Minecraft FAMILY 4.5 2376564 12.0 10000000 Paid 6.99 Everyone 10+ Arcade;Action & Adventure 2018-07-24 1.5.2.1 Varies with device Top rated 69900000.0

10). Distribution of apps based on its size¶

In [474]:
# Values calculated earlier
[mean_size,median_size,max_size,min_size]
Out[474]:
[20.4136, 12.0, 100.0, 0.0083]
  • The size of an app in our database varies from 100 MB to 0.0083 MB. We can analyse the size of the apps if we can group them into certain intervals.

  • We have already established that the data in the numeric values in the 'Size' column are skewed towards the left.

  • Lets group the data in the size column as follows into intervals of 10 each:

(< 1 MB, 1-10, 10-20, 20-30, ..., 90-100, 'Varies with device')

Lets create a function to create the size intervals

In [476]:
# Function to group the apps based on its size in MB

def size_apps(var):
  '''
  This function groups the size of an app 
  between ~0 to 100 MB into certain intervals.
  '''
  try:
    if var < 1:
      return 'Below 1'
    elif var >= 1 and var <10:
      return '1-10'
    elif var >= 10 and var <20:
      return '10-20'
    elif var >= 20 and var <30:
      return '20-30'
    elif var >= 30 and var <40:
      return '30-40'
    elif var >= 40 and var <50:
      return '40-50'
    elif var >= 50 and var <60:
      return '50-60'
    elif var >= 60 and var <70:
      return '60-70'
    elif var >= 70 and var <80:
      return '70-80'
    elif var >= 80 and var <90:
      return '80-90'
    else:
      return '90 and above'
  except:
    return var

Lets apply the size_group function on the Size column and store the results in a new column named Size group.

In [477]:
ps_df['size_group']=ps_df['Size'].apply(lambda x : size_apps(x))
ps_df.head()
Out[477]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver Rating_group size_group
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19.0 10000 Free 0.0 Everyone Art & Design 2018-01-07 1.0.0 4.0.3 and up Top rated 10-20
1 Coloring book moana ART_AND_DESIGN 3.9 967 14.0 500000 Free 0.0 Everyone Art & Design;Pretend Play 2018-01-15 2.0.0 4.0.3 and up Above Average 10-20
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 8.7 5000000 Free 0.0 Everyone Art & Design 2018-08-01 1.2.4 4.0.3 and up Top rated 1-10
3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 215644 25.0 50000000 Free 0.0 Teen Art & Design 2018-06-08 Varies with device 4.2 and up Top rated 20-30
4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 967 2.8 100000 Free 0.0 Everyone Art & Design;Creativity 2018-06-20 1.1 4.4 and up Top rated 1-10
In [478]:
# no of apps belonging to each size group
ps_df['size_group'].value_counts().plot.barh(figsize=(20,8),color='r').invert_yaxis()
plt.title("Number of apps in different size groups", size=20)
plt.ylabel('App size in MB', size=15)
plt.xlabel('No of apps', size=15)
plt.legend()
Out[478]:
<matplotlib.legend.Legend at 0x2eb4f57bb20>
In [479]:
# average no of user reviews in each size group
ps_df.groupby('size_group')['Reviews'].mean().sort_values().plot.barh(figsize=(20,8), color = 'green')
plt.title("Average number of user reviews (in millions)", size=20)
plt.xlabel('Average no of user reviews', size=15)
plt.ylabel('App size in MB', size=15)
plt.legend()
Out[479]:
<matplotlib.legend.Legend at 0x2eb4f5dcf40>
In [480]:
# average number of app installs in each category

ps_df.groupby('size_group')['Installs'].mean().sort_values(ascending= False).plot.barh(figsize=(20,8),color='sandybrown').invert_yaxis()
plt.title("Average number of app installs (In 10 millions)", size=20)
plt.ylabel('App size in MB', size=15)
plt.xlabel('Average no of app installs',  size=15)
plt.legend()
Out[480]:
<matplotlib.legend.Legend at 0x2eb4f61e340>
  • The sizes of the majority of the apps range in between 1 and 20 MB.
  • There are a good number of apps whose size varies with the device.

  • The apps which are smaller in size on average have lower no of app installs and user reviews.

11). Android version based on each category¶

Now I am going to group it to 1 till 8 version of android. Change the null value to 1.0.

In [481]:
ps_df['Android Ver'].replace(to_replace=['4.4W and up','Varies with device'], value=['4.4','1.0'],inplace=True)
ps_df['Android Ver'].replace({k: '1.0' for k in ['1.0','1.0 and up','1.5 and up','1.6 and up']},inplace=True)
ps_df['Android Ver'].replace({k: '2.0' for k in ['2.0 and up','2.0.1 and up','2.1 and up','2.2 and up','2.2 - 7.1.1','2.3 and up','2.3.3 and up']},inplace=True)
ps_df['Android Ver'].replace({k: '3.0' for k in ['3.0 and up','3.1 and up','3.2 and up']},inplace=True)
ps_df['Android Ver'].replace({k: '4.0' for k in ['4.0 and up','4.0.3 and up','4.0.3 - 7.1.1','4.1 and up','4.1 - 7.1.1','4.2 and up','4.3 and up','4.4','4.4 and up']},inplace=True)
ps_df['Android Ver'].replace({k: '5.0' for k in ['5.0 - 6.0','5.0 - 7.1.1','5.0 - 8.0','5.0 and up','5.1 and up']},inplace=True)
ps_df['Android Ver'].replace({k: '6.0' for k in ['6.0 and up']},inplace=True)
ps_df['Android Ver'].replace({k: '7.0' for k in ['7.0 - 7.1.1','7.0 and up','7.1 and up']},inplace=True)
ps_df['Android Ver'].replace({k: '8.0' for k in ['8.0 and up']},inplace=True)
ps_df['Android Ver'].fillna('1.0', inplace=True)
In [482]:
print(ps_df.groupby('Category')['Android Ver'].value_counts())
Type_cat = ps_df.groupby('Category')['Android Ver'].value_counts().unstack().plot.bar(figsize=(25,8), width=2)
plt.xticks()
plt.show()
Category        Android Ver
ART_AND_DESIGN  4.0            50
                2.0             9
                3.0             2
                1.0             1
                5.0             1
                               ..
WEATHER         4.0            38
                1.0            23
                2.0            10
                5.0             7
                3.0             1
Name: Android Ver, Length: 199, dtype: int64

Findings:

It is clearly evident from the above plot that majority of the apps are working on Android_Ver 4.0 and up.

negative.jfif

Data Visualization on User Reviews:¶

1). Percentage of Review Sentiments¶

In [483]:
# Basic inspection
ur_df.columns
Out[483]:
Index(['App', 'Translated_Review', 'Sentiment', 'Sentiment_Polarity',
       'Sentiment_Subjectivity'],
      dtype='object')
In [484]:
import matplotlib
counts = list(ur_df['Sentiment'].value_counts())
labels = 'Positive Reviews', 'Negative Reviews','Neutral Reviews'
matplotlib.rcParams['font.size'] = 20
matplotlib.rcParams['figure.figsize'] = (10, 15)
plt.pie(counts, labels=labels, explode=[0.01, 0.05, 0.05], shadow=True, autopct="%.2f%%")
plt.title('Percentage of Review Sentiments', fontsize=20)
plt.axis('off')
plt.legend(bbox_to_anchor=(0.9, 0, 0.5, 1))
plt.show()

Findings:

  1. Positive reviews are 64.30%
  2. Negative reviews are 22.80%
  3. Neutral reviews are 12.90%

2). Apps with the highest number of positive reviews¶

In [485]:
# positive reviews
positive_ur_df=ur_df[ur_df['Sentiment']=='Positive']
positive_ur_df
Out[485]:
App Translated_Review Sentiment Sentiment_Polarity Sentiment_Subjectivity
0 10 Best Foods for You I like eat delicious food. That's I'm cooking ... Positive 1.000000 0.533333
1 10 Best Foods for You This help eating healthy exercise regular basis Positive 0.250000 0.288462
3 10 Best Foods for You Works great especially going grocery store Positive 0.400000 0.875000
4 10 Best Foods for You Best idea us Positive 1.000000 0.300000
5 10 Best Foods for You Best way Positive 1.000000 0.300000
... ... ... ... ... ...
64217 Housing-Real Estate & Property I able set range 1cr, scroll space 0-1cr range... Positive 0.233333 0.550000
64221 Housing-Real Estate & Property Everything old stuff neither clear sold proper... Positive 0.021591 0.259470
64222 Housing-Real Estate & Property Most ads older many agents ..not much owner po... Positive 0.173333 0.486667
64223 Housing-Real Estate & Property If photos posted portal load, fit purpose. I'm... Positive 0.225000 0.447222
64227 Housing-Real Estate & Property I property business got link SMS happy perform... Positive 0.800000 1.000000

23998 rows × 5 columns

In [486]:
positive_ur_df.groupby('App')['Sentiment'].value_counts().nlargest(10).plot.barh(figsize=(10,8),color='seagreen').invert_yaxis()
plt.title("Top 10 positive review apps")
plt.xlabel('Total number of positive reviews')
plt.legend()
Out[486]:
<matplotlib.legend.Legend at 0x2eb50c13730>

3). Apps with the highest number of negative reviews.¶

In [487]:
negative_ur_df=ur_df[ur_df['Sentiment']=='Negative']
negative_ur_df
Out[487]:
App Translated_Review Sentiment Sentiment_Polarity Sentiment_Subjectivity
32 10 Best Foods for You No recipe book Unable recipe book. Negative -0.500000 0.500000
43 10 Best Foods for You Waste time It needs internet time n ask calls ... Negative -0.200000 0.000000
68 10 Best Foods for You Faltu plz waste ur time Negative -0.200000 0.000000
85 10 Best Foods for You Crap Doesn't work Negative -0.800000 0.800000
95 10 Best Foods for You Boring. I thought actually just texts that's i... Negative -0.325000 0.475000
... ... ... ... ... ...
64215 Housing-Real Estate & Property Horrible app. I wanted list property get aroun... Negative -0.528571 0.717262
64216 Housing-Real Estate & Property Worst app. We get nothing Time waste . They up... Negative -0.400000 0.250000
64220 Housing-Real Estate & Property No response support team. After I login, unabl... Negative -0.377778 0.533333
64226 Housing-Real Estate & Property Dumb app, I wanted post property rent give opt... Negative -0.287500 0.250000
64230 Housing-Real Estate & Property Useless app, I searched flats kondapur, Hydera... Negative -0.316667 0.400000

8271 rows × 5 columns

In [488]:
negative_ur_df.groupby('App')['Sentiment'].value_counts().nlargest(10).plot.barh(figsize=(15,8),color='crimson').invert_yaxis()
plt.title("Top 10 negative review apps")
plt.xlabel('Total number of negative reviews')
plt.legend()
Out[488]:
<matplotlib.legend.Legend at 0x2eb4f6eca90>

4). Histogram of Subjectivity¶

In [489]:
merged_df.Sentiment_Subjectivity.value_counts()
Out[489]:
0.000000    4134
1.000000    1653
0.500000    1579
0.600000    1133
0.750000    1095
            ... 
0.508052       1
0.454167       1
0.417316       1
0.765000       1
0.545714       1
Name: Sentiment_Subjectivity, Length: 4382, dtype: int64
In [490]:
plt.figure(figsize=(18,9))
plt.xlabel("Subjectivity")
plt.title("Distribution of Subjectivity")
plt.hist(merged_df[merged_df['Sentiment_Subjectivity'].notnull()]['Sentiment_Subjectivity'])
plt.show()

Findings:

0 - objective(fact), 1 - subjective(opinion)

It can be seen that maximum number of sentiment subjectivity lies between 0.4 to 0.7. From this we can conclude that maximum number of users give reviews to the applications, according to their experience.

5). Is sentiment_subjectivity proportional to sentiment_polarity?¶

In [491]:
# scatterplot of sentiment polarity and sentiment subjectivity
plt.figure(figsize=(15, 10))
sns.scatterplot(ur_df['Sentiment_Subjectivity'], ur_df['Sentiment_Polarity'],
                hue = ur_df['Sentiment'], edgecolor='white', palette="inferno")
plt.title("Google Play Store Reviews Sentiment Analysis", fontsize=20)
plt.show()

From the above scatter plot it can be concluded that sentiment subjectivity is not always proportional to sentiment polarity but in maximum number of case, shows a proportional behavior, when variance is too high or low

How Content Rating affect over the App¶

1.) Paid App Content Rating¶

In [492]:
paid_df['Content Rating'].value_counts().plot.bar(figsize=(10,10),color='c')
plt.legend()
Out[492]:
<matplotlib.legend.Legend at 0x2eb5220e670>

2.) Free App content Rating¶

In [493]:
free_df['Content Rating'].value_counts().plot.bar(figsize=(10,10),color='blue')
plt.legend()
Out[493]:
<matplotlib.legend.Legend at 0x2eb50c18430>

Most Number of content ratings which got on Google Play Store can be used by everyone.The remaining apps have various age restrictions to use it.

3.) Does Last Update date has an effects on rating?¶

In [494]:
print(ps_df['Last Updated'].head())
#fetch update year from date
ps_df["Update year"] = ps_df["Last Updated"].apply(lambda x: x.strftime('%Y')).astype('int64') 
0   2018-01-07
1   2018-01-15
2   2018-08-01
3   2018-06-08
4   2018-06-20
Name: Last Updated, dtype: datetime64[ns]
In [495]:
fig, ax = plt.subplots(figsize=(12,6))
sns.regplot(x="Update year", y="Rating", data=ps_df)
plt.title("Update Year VS Rating")
Out[495]:
Text(0.5, 1.0, 'Update Year VS Rating')

4.) Distribution of App update over the Year¶

In [496]:
paid_df["Update year"] = paid_df["Last Updated"].apply(lambda x: x.strftime('%Y')).astype('int64') 
free_df["Update year"] = free_df["Last Updated"].apply(lambda x: x.strftime('%Y')).astype('int64') 
In [497]:
paid_df.groupby("Update year")["App"].count().plot.line(marker='o')
free_df.groupby('Update year')['App'].count().plot.line(marker='o')
Out[497]:
<AxesSubplot:xlabel='Update year'>

In the above plot, we plotted the apps updated or added over the years comparing Free vs. Paid, by observing this plot we can conclude that before 2011 there were no paid apps, but with the years passing free apps has been added more in comparison to paid apps, By comparing the apps updated or added in the year 2011 and 2018 free apps are increases from 80% to 96% and paid apps are goes from 20% to 4%. So we can conclude that most of the people are after free apps

5.) Distribution of Paid and Free app updated over the Month¶

In [498]:
paid_df["Update month"] = paid_df["Last Updated"].apply(lambda x: x.strftime('%m')).astype('int64') 
free_df["Update month"] = free_df["Last Updated"].apply(lambda x: x.strftime('%m')).astype('int64') 
In [499]:
paid_df.groupby("Update month")["App"].count().plot.bar(figsize=(10,8), color= "green")
plt.title("Paid Apps update over the month", size=20)
plt.legend()
Out[499]:
<matplotlib.legend.Legend at 0x2eb4e8cf160>

Most of the paid apps too updates in the month of July same as free apps.

In [500]:
free_df.groupby("Update month")["App"].count().plot.bar(figsize=(10,8), color='purple')
plt.title("Free Apps update over the month", size=20)
plt.legend()
Out[500]:
<matplotlib.legend.Legend at 0x2eb47440730>

In this data almost 50% apps are added or updated on the month of July, 25% of apps are updated or added on the month of August and rest of 25% remaining months.

▶Analysis Summary¶

In this project of analyzing play store applications, we have worked on several parameters which would help AlmaBetter to do well in launching their apps on the play store.

In the initial phase, we focused more on the problem statements and data cleaning, in order to ensure that we give them the best results out of our analysis.

AlmaBetter needs to focus more on:

  1. Developing apps related to the least categories as they are not explored much. Like events and beauty.
  2. Most of the apps are Free, so focusing on free app is more important.
  3. Focusing more on content available for Everyone will increase the chances of getting the highest installs.
  4. They need to focus on updating their apps regularly, so that it will attract more users.
  5. They need to keep in mind that the sentiments of the user keep varying as they keep using the app, so they should focus more on users needs and features.
  • Percentage of free apps = ~92%
  • Percentage of apps with no age restrictions = ~82%
  • Most competitive category: Family
  • Category with the highest average app installs: Game
  • Percentage of apps that are top rated = ~80%
  • Family, Game and Tools are top three categories having 1906, 926 and 829 app count.
  • Tools, Entertainment, Education, Buisness and Medical are top Genres.
  • 8783 Apps are having size less than 50 MB. 7749 Apps are having rating more than 4.0 including both type of apps.
  • There are 20 free apps that have been installed over a billion times.
  • Minecraft is the only app in the paid category with over 10M installs. This app has also produced the most revenue only from the installation fee.
  • Category in which the paid apps have the highest average installation fee: Finance
  • The median size of all apps in the play store is 12 MB.
  • The apps whose size varies with device has the highest number average app installs.
  • The apps whose size is greater than 90 MB has the highest number of average user reviews, ie, they are more popular than the rest.
  • Helix Jump has the highest number of positive reviews and Angry Birds Classic has the highest number of negative reviews.
  • Overall sentiment count of merged dataset in which Positive sentiment count is 64%, Negative 22% and Neutral 13%.

1.Rating

Most of the apps have rating in between 4 and 5.

Most numbers of apps are rated at 4.3

Categories of apps have more than 4 average rating.item

2.Size

Maximum number of applications present in the dataset are of small size.

3.Installs

Majority of the apps come into these three categories, Family, Game, and Tools.

Maximum number of apps present in google play store come under Family, Game and tools but as per the installation and requirement in the market plot, scenario is not the same. Maximum installed apps comes under Game, Communication, Productivity and Social.

Subway Surfers, Facebook, Messenger and Google Drive are the most installed apps.

4.Type(Free/Paid)

About 92% apps are free and 8% apps are of paid type.

The category ‘Family’ has the highest number of paid apps.

Free apps are installed more than paid apps.

The app “I’m Rich — Trump Edition” from the category ‘Lifestyle’ is the most costly app priced at $400

5.Content Rating

Content having Everyone only has most installs, while unrated and Adults only 18+ have less installs.

6.Reviews

Number of installs is positively correlated with reviews with correlation 0.64. Sentiment Analysis

7.Sentiment

Most of the reviews are of Positive Sentiment, while Negative and Neutral have low number of reviews.

8.Sentiment Polarity / Sentiment Subjectivity

Collection of reviews shows a wide range of subjectivity and most of the reviews fall in [-0.50,0.75] polarity scale implying that the extremely negative or positive sentiments are significantly low. Most of the reviews show a mid-range of negative and positive sentiments.

Sentiment subjectivity is not always proportional to sentiment polarity but in maximum number of case, shows a proportional behavior, when variance is too high or low.

Sentiment Polarity is not highly correlated with Sentiment Subjectivity.

Challenges & Future Work¶

  1. Our major challenge was data cleaning.
  2. 13.60% of reviews were NaN values, and even after merging both the dataframes, we could not infer much in order to fill them. Thus we had to drop them.
  3. The merged data frame of both play store and user reviews, had only 816 common apps. This is just 10% of the cleaned data, we could have given more valuable analysis, if we had atleast 70% - 80% of the data available in the merged dataframes.
  4. User Reviews had 42% of NaN values, which could have been used for developing an understanding of the category wise sentiments, which would help us to fill 13.60% NaN values of the Reviews column.
  5. There is so much more which can be explored. Like we have current version, android version available which can be explored in detail and we can come out with more analysis where we can tell how does these things effect and needs to be kept in mind while developing app for the users.
  6. We can explore the correlation between the size of the app and the version of Android on the number of installs.
  7. Machine learning can help us to deploy more insights by developing models which can help us interpret even more better. We have left this as future work as this is something where we can work on.
In [ ]:
 
In [ ]: